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To all whom it may concern: 
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YIN ZHANG 

have invented certain new and useful method and apparatus for 

SCALABLE ATOMIC MULTICAST 

of which the following is a full, clear and exact description. 
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SCAT ABLE ATOMIC MTJLTTCAST (SAM) 

CROSS REFERENCE TO RELATED APPLICATION 

5 

This application claims the priority benefits of copending U.S. Provisional Application No. 
60/098,065, filed on August 27, 1998. 

TECHNICAL FIELD 

10 This invention relates to communication protocols for distributed networks requiring state 

•0 synchronization between nodes. 

% BACKGROUND OF THE INVENTION 

=7 Many networking applications involve large numbers of end-points, i.e. 9 nodes. Such 

M applications may require their components to synchronize states reliably in a highly distributed 
2 environment. Well known examples of the problem include enforcing consistency in a distributed 
^ database and maintaining cache coherency in a distributed multiprocessor environment. Although 

the problem has existed for a long time, recent exponential growth of Internet and proliferation of 

Internet-related applications bring it to the foreground and underscore the need for more efficient 
20 solutions. Moreover, Internet-related applications can be distributed over thousands of end-points 

and often operate in real time, complicating straightforward extension of previously known methods. 

As one illustration, let us take a brief look at the architecture of a network router. 



- 1 - 



Express Mail No.: EL320480176US 

A typical router has a single router controller managing multiple forwarding devices. A 
single router can easily cause performance bottlenecks; it is also a single point of failure. High 
performance routers of the next generation will likely include multiple router controllers working 
in parallel. In such an architecture, consistency must be maintained over all router controllers' 
5 forwarding tables. This is a classical state synchronization requirement. 

Such synchronization requirement can be naturally supported by an atomic multicast service, 
which ensures both atomicity and total ordering over all messages sent to the multicast group. By 
; ^ atomicity we mean that any message sent to the group is delivered either to all or none of the 
I j operational group members. Total ordering means that messages are delivered in the same order at 
lp all such group members. Note that here we use delivered rather than received. By delivered we 
'2 mean that a message is passed to applications sitting on top of the atomic multicast service. The 
|U order in which messages get delivered can be different from the order in which they are received. 
m As a fundamental abstraction for building distributed reliable applications, atomic multicast 

*D has been widely studied in the field, and has been actually implemented in a number of working 
15 systems, such as Isis and Horus. Below we present a brief overview of the previous work. 

Isis ABCAST Algorithm 

The Isis system is one of the pioneering protocols that support atomic multicast. Isis is 
described in K.P. Birman et aL, Lightweight Causal and Atomic Group Multicast, ACM 
20 Transactions on Computer Sys., August 1991, and in K.P. Birman & T. Joseph, Reliable 
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Communication in the Presence of Failures, ACM TRANSACTIONS ON COMPUTER SYS., February 
1987. Both above-mentioned articles are hereby incorporated by reference as if fully set forth 
herein. 

The Isis ABCAST primitive achieves atomicity and total ordering based on a three-way 
5 commit protocol To send a message from a client/sender, the following steps are performed: 

1 . A sender transmits the message to all of its destinations. 

2. Upon receipt of the message, each recipient assigns it a priority number larger than 
i=sa the priority of any message received but not yet delivered; the recipient then informs 
y the sender of the priority it assigned to the message. 

Iff 3 . The sender collects responses from the recipients that remain operational, computes 

-D the maximum value of all the priorities it had received, and sends this value back to 

^ all the recipients. 

m 4. The recipients change priority of the message to the value received from the sender; 

C= they can then deliver messages in order of increasing priority. 

15 A number of factors contribute to the poor scalability of Isis. First, to send a message, the 

sender has to block until the communication completes. During this period, no other message can 
be sent. This means that the performance of the entire multicast group is limited by the slowest 
receiver. 

Second, explicit knowledge of group membership is required to ensure reliability. The 
20 management of group membership is expensive. Moreover, whenever the group membership 
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changes, the entire group has to block until every member has installed the new view of the group 
membership. This is undesirable in many cases. For example, in the router context mentioned 
above, new router controllers are added when the system load is high. Blocking the entire controller 
group can easily cause disastrous network congestion in this case. 
5 Finally, the overhead for sending a message is relatively high. For each multicast message, 

three communication steps are required to ensure the proper delivery of the multicast message, even 
if the communication channel is perfect and no group member fails. Furthermore, an overhead of 
2n total messages is involved in the best case, where n is the group size. 

For all these reasons, the ABCAST algorithm typically cannot scale to more than 100 
W members. 

!L- Sequencer-Site Algorithms 

py This class of algorithms is described in, inter alia, MF. Kaashoek et al 7 An Efficient Reliable 

ifj Broadcast Protocol, OPERATING SYS. REV., October 1989, hereby incorporated by reference as if 
15 fully set forth herein. Sequencer-Site algorithms achieve total ordering by using an elected process 
- a sequencer - responsible for assigning sequence numbers to all multicast messages and then 
multicasting the messages to the entire group. This algorithm requires a single communication step 
in the optimal case where the sequencer is also the source of the message, and two steps in all other 
cases. Because of the high load on the sequencer, the algorithm is considered non-scalable even for 
20 medium size systems. 
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Rotating-Token Algorithms 

These algorithms are described in the following sources: 
(1) Y. Amir et ah, The Totem Single-Ring Ordering and Membership Protocol, ACM 
5 Transactions on Computer Sys., November 1995; (2) J.M. Chang and N. Maxemchuck, Reliable 
Broadcast Protocols, ACM TRANSACTIONS ON COMPUTER Sys., August 1984; (3) Robbert van 
Renesse et ah, Horus: A Flexible Group Communications System, COMM. OF ACM, April 1996; and 
(4) L.E. Moser et al, Extended Virtual Synchrony, IEEE 14th Int'l Conf. on Distributed Computing 
sti Sys. ? June l" 4 - These articles are hereby incorporated by reference as if fully set forth herein. 
ij| The algorithms in this class are similar to the sequencer-site algorithms, but they rotate the 

.<fi role of the sequencer, i.e. 9 pass (l)he token, among several processes. Thus, before any message can 
L be sent, the sender has to acquire a "token." The token-holder then places a sequence number on 
Wi each message it multicasts ? and messages that arrive out of sequence are delayed until they can be 
\u delivered in order. The rotating-token algorithm alone can not guarantee message atomicity. It is 
15 usually combined with knowledge of group membership to achieve atomic multicast. 

Rotating-token algorithms provide load balancing and avoid network contention when shared 
links are used, as is the case, for instance, in Ethernet-based LANs. Unfortunately, token 
management usually involves substantial overhead. In addition, in the worst delay case, a client- 
sender may need to wait for a complete rotation of the token before it can send any messages. This 
20 can lead to excessive latency. 
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Symmetric Algorithms 

These algorithms are based on Lamport's total order algorithm, described in L. Lamport, 
Time, Clocks, and the Ordering of Events in a Distributed System, COMM. OF ACM, July 1978. See 
5 also L. Rodrigues et aL, Totally Ordered Multicast in Large-Scale Systems, IEEE 1 6th Int'L Conk 
Distributed Computing Sys., May 1996. The Lamport and Rodrigues articles are hereby 
incorporated by reference as if fully set forth herein. 

In this scheme, data messages are delivered according to the order defined by the timestamps 
{7* assigned at multicast time. In order to be live, algorithms in this class require correct processes to 
2ij§ multicast messages periodically. Alternatively, an additional communication step is required. Total 
*D order can be established in a single communication step when all processes broadcast 
:L, simultaneously, and in two steps in all other cases. Unfortunately, in such symmetric algorithms, 
n\ all group members are involved in the communication. This means that the entire system has to 
;fl cater to the slowest member. 
15 

Chandra and Toueg's Algorithm 

This algorithm requires two steps: (1) reliably broadcasting a message; followed by (2) 
execution of a consensus. See T.D. Chandra & $. Toueg, Unreliable Failure Detectors for Reliable 
Distributed Systems, J. OF ACM, March 1996. This article is hereby incorporated by reference as 
20 if fully set forth herein. 
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The consensus algorithm is based on a failure detector (oS) that requires three 
communication steps; thus, in the best case, a total of four communication steps are required to run 
the total order broadcast algorithm. The Chandra-Toueg algorithm requires (n-l) 2 messages for the 
first step (reliable broadcast), and (2(n-l)+(nA) 2 ) messages for the second step (consensus 
5 execution), for a total of (2(n-l) 2 + 2(n-l)) messages, where n is the multicast group size. Clearly, 
the second order group size dependence scales poorly. 

OBJECT OF THE INVENTION 

y Accordingly, the object of the present invention is to provide a robust atomic multicast 

W communication protocol with good scaling properties. 

jU SUMMARY OF THE INVENTION 

m In order to accomplish the aforementioned objects of the invention, the inventive steps for 

■B processing a data message sent to members of a multicast group include: 

15 1 . Each data server that receives the data message requests the sequencer to assign a 

sequence number to the data message; 

2. After the sequencer receives a predetermined number of requests to assign a sequence 
number to the data message, the sequencer assigns the next sequence number to the message and 
submits this sequence number to commit servers; 
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3. After a commit server receives the assigned sequence number, it sends to the 
sequencer an acknowledgment of receipt of the sequence number; 

4. Once the sequencer has received a predetermined number 

of acknowledgments of receipt of the sequence number assigned, it commits, /.e., permanently 
associates, the assigned sequence number to the data message and informs the entire group about the 
commitment. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figures 1(a) and 1(b) illustrate one implementation of the SAM protocol 
Figure 2 shows delivery of a data message in SAM. 

DETAILED DESCRIPTION OF THE INVENTION 

In the remainder of this specification, we assume that the underlying communication layer 
provides support for asynchronous, unreliable multicast communication. When multicast is not 
available, it can be easily simulated with a series of unicast transmissions. Although the underlying 
communication need not be reliable, the protocol can benefit from any technique that increases 
reliability, such as forward error correction (FEC). Similarly, any multicast congestion control 
technique can be easily incorporated in the system. We do not, however, consider multicast 
congestion control in this work. 
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SAM supports an open multicast group model. In other words, any process can send 
messages to the multicast group. This is different from some systems, where only the members of 
the group are allowed to send messages to the group. 

The following discussion also assumes that each process in the system has a unique process 
5 identifier (PID), and that each message sent to the group has a unique message ID (MED), but these 
qualities are not strictly necessary. 

In SAM architecture, each process takes one of the following five different roles: 
Sequencer. At any time, there is a single sequencer in the system. The sequencer serializes 
;t? data messages sent to the group by assigning a unique sequence number to each message. The 
ij| sequencer assigns sequence numbers sequentially, the increasing (progressing) order of sequence 
C 5 numbers forming total ordering over all messages sent to the group. As long as all members of the 
JL. multicast group deliver the messages based on their sequence numbers, they will deliver messages 
jjiTs in the same order. 

iQ Commit Server. Commit servers store the ordering information for each message, Le. 9 the 

15 <message ID, sequence number> pair. Moreover, if the sequencer goes down, one of the commit 
servers will take over as the new sequencer. 

Data Server. Data servers store the data messages sent to the group. Their main function 
is to support persistent message delivery, Le., providing old data messages to the group members. 
This capability is important when a new member joins the group, and when re-delivery of a message 
20 is required after the original message delivery to an existing member fails. 
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Checkpoint Server. Checkpoint servers consolidate messages. 
Client. All other members of the group are clients. 

Note that these are logical entities only. They are introduced for clarity of description. 
Different types of processes need not be physically separate - a single process can perform different 
5 functions. 

Clients may be distributed over a wide area. On the other hand, it is preferable that all 
servers - the sequencer, commit servers, data servers, and checkpoint servers - be located on the 
^ same LAN/SAN. We can easily achieve this by using a cluster for the servers. 
JtI We further assume four multicast channels in the system: one channel (GLOB ALCHAN) for 

igt the entire group, another channel (DATACHAN) for all data servers, the third channel 
O (COMMITCHAN) for all commit servers, and the fourth channel (CHKPNTCHAN) for all 
checkpoint servers. Again, these logically different channels need not be physically distinct. Since 
ni we assume that all servers are on the same LAN/SAN, we can have a single multicast channel for 
C 5 all servers, rather than having multiple server channels. 

15 To improve the basic five step method described immediately above, we use a 

receiver-driven, negative acknowledgment (NACK) based approach. 

Clients are required to deliver messages to upper layer applications in order of increasing 
sequence numbers. Since sequence numbers are assigned sequentially, any gap in the sequence 
numbers indicates that some messages are missing. The sequencer keeps assigning new sequence 

20 numbers and informing the entire group of these numbers. Moreover, the sequencer periodically 
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sends out gossip (heartbeat) messages that contain the largest sequence number in the system. In 
this way, the clients can easily detect non receipt of data and Commit messages. (Unless otherwise 
indicated, we do not distinguish between a data message that has not been received by a client and 
a received data message without a corresponding Commit message; for convenience, we generally 
5 refer to such messages as "missing.") 

Once a client detects that a message is missing, he can get this message through the following 
two step procedure, as illustrated in Figure 2: 

1. If the message ID is unknown, the client first sends a query to one of the commit 
{7i servers to find out the message ID corresponding to the missing sequence number; otherwise, go to 
LM the next step. 

*S 2. After getting the message ED, the client first checks whether this data message has 

'L, already been received; if so, the client knows that the message has been committed to the missing 
y$t sequence number; otherwise, the client sends another query to one of the data servers to retrieve the 
m data message with the corresponding message ID. 
15 Here we have chosen to decouple the query for the message ID from the query for the actual 

data message. Therefore, the client has to do two queries in the worst case. Alternatively, we can 
have the client do only one query and have the servers send the data message corresponding to the 
missing sequence number to the client directly. The problem with the second approach is that it is 
possible that the client has already received the data message; the only thing missing is the 
20 sequencer's Commit message containing the sequence number of the data message. In this case, it 
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is wasteful to retransmit the message. When the group is large, such unnecessary overhead becomes 
significant. On the other hand, if the client misses multiple messages, the overhead of extra queries 
for the message ED can be amortized over the multiple missing messages by including multiple 
queries in a single message. 

For a protocol using a NACK-based approach to be truly reliable, a copy of each multicast 
data packet must be stored somewhere within the network. In SAM, senders no longer need to do 
this - the data servers store all the data messages sent to the group. The storage problem, however, 
still exists - it has merely shifted from the senders to the data servers. SAM solves the problem by 
supporting message consolidation. 

Many real world state synchronization problems resemble updating a table in an 
asynchronous, message-passing system. More specifically, the system state can be modeled as 
content of a table, with each process-group member keeping a local copy of the table; the data 
messages sent to the group operate on the table, querying or modifying the table's content. The goal 
is to maintain consistency among all local copies of the table. This is exactly the case in the router 
context we mentioned before. 

If the table-updating paradigm represents the state synchronization problem of the multicast 
group, the data servers need to store only the cumulative state of the system, rather than the entire 
history of operations on the table. Therefore, the data messages can be consolidated by 
checkpointing the system state periodically, for example every 50,000 messages with consecutive 
sequence numbers. We define the sequence number of a checkpoint as the maximum sequence 
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number among all messages consolidated by the checkpoint. To synchronize its upper level 
application, a new client, or a client that became asynchronous, no longer needs to deliver all the 
data messages that have been sent to the multicast group; the client needs to retrieve and deliver only 
the latest checkpoint and all committed but not checkpointed data messages with sequence numbers 
larger than the sequence number of the checkpoint. All checkpointed data messages can thus be 
garbage collected. Moreover, no checkpoint server needs to be blocked during synchronization 
because the client requests each block of the checkpoint separately, though such logically separate 
requests may be grouped in the same query message. 

The task of checkpointing the system state is performed by the checkpoint servers. A 
checkpoint server is just a client with some special upper level checkpointing application. Just like 
any other client, a checkpoint server needs to deliver a checkpoint and all the messages with higher 
sequence numbers to its upper layer applications. Here, the checkpointing application provides the 
following functionality specific to checkpointing: 

1. Make checkpoints periodically; that is, make a checkpoint every predetermined 
number of messages; the checkpoint period may be fixed and agreed upon by all such checkpoint 
applications; alternatively, the checkpoint period may be dynamic, determined by some consensus 
protocol 

2. Periodically send Checkpoint Reports, with checkpoint information, to the sequencer, 
so that the sequencer can inform the entire group about the latest checkpoint; checkpoint information 
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includes the corresponding sequence number and the size of each checkpoint; with this information, 
a client can request a specific block of a specific checkpoint from a checkpoint server. 
3 . Provide checkpoint data when requested. 

To ensure reliability, the sequencer advertises information only about checkpoints replicated 
with sufficient redundancy, according to the user's requirements. That is, the sequencer advertises 
only those checkpoints that have been reported by a predetermined number of checkpoint servers. 

Before explaining our garbage collection algorithm, let us first define the concept of "logical 
timestamp" of a data message. The logical timestamp is defined as the largest sequence number that 
the sender-client knew about at the time when the message was sent for the first time. This concept 
therefore may not apply to data messages generated outside the multicast group. Every retransmitted 
data message bears the logical timestamp of the original message. Thus, logical timestamp indicates 
the sender's view of the current system state. 

To improve garbage collection, we also introduce the concept of "maximum logical lifetime". 
The sequencer assigns sequence numbers only to those messages whose logical timestamp is greater 
than [maximum sequence number assigned by the sequencer - maximum logical lifetime]. That is, 
any data message becomes garbage if it can not be assigned a sequence number within the next 
[maximum logical lifetime] sequence numbers assigned by the sequencer. 

In SAM, any information can be garbage collected if either of the following two conditions 

holds: 
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L The information has been checkpointed already; thus, if the logical timestamp of a 
data message is older than the latest system checkpoint number, the data message has been 
checkpointed and hence can be garbage collected. 

2. The information is too old to be useful; for example, a data server can stop requesting 
5 the sequencer to assign a sequence number to a data message if the logical timestamp of the data 
message becomes less than [the largest sequencer number the data server has ever seen - maximum 
logical lifetime]; this is because the sequencer will not assign a sequence number to the message if 
;ss it has not done so already. 

iTJ It may be advisable to keep the information longer than strictly necessary, in order to allow 

M a client that is missing a data message to retrieve it from a data server directly, rather than resort to 
k S the two step synchronization process. 

L Recovery from machine crashes is easy with SAM. Because the system generally has 

m redundant data, commit, and checkpoint servers, no information is lost when one of the servers 
=fl crashes. If a client crashes, it can recover by synchronizing itself, as described above. 
15 Recovery from sequencer failure, however, requires a different procedure. First, a new 

sequencer is selected from among the commit servers. This can be achieved using any well-known 

distributed leader election algorithm. 

Second, the new sequencer finds out the maximum sequence number in the system. The 

preferred way of achieving this is by using stable storage. Information stored in stable storage must 
20 be unaffected by sequencer failure. Whenever the sequencer assigns a new sequence number, it 
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always first writes the maximum sequence number to the stable storage. In this way, the newly 
elected sequencer can immediately find out the maximum sequence number by reading the stable 
storage. Note that we only need to store the maximum sequence number in stable storage; hence, 
a very small amount of stable storage is required. Alternatively, the new sequencer can query all 
commit servers for the highest committed sequence number. 

Third ? the new sequencer gets the history for the last [maximum logical lifetime] assignments 
of sequence numbers. Knowledge of the history is necessary to ensure that the sequencer does not 
assign sequence numbers to the messages that already have sequence numbers assigned to them. 
This does not mean that the newly elected sequencer cannot assign new sequence numbers until it 
obtains the entire requisite history; the new sequencer can assign a sequence number to any data 
packet as long as it knows that this has not been done before; therefore, if the assignments of all 
sequence numbers greater than the logical timestamp of the data message are known, the sequencer 
can assign a sequence number to it. 

SAM is a robust algorithm providing reliable data message delivery. It can be designed to 
withstand numerous failures. As with any fault-tolerant system, this is achieved through data 
replication. Here, we give a brief summary of SAM's data replication provisions discussed above 
in various contexts. 

Given a common redundancy requirement k 9 the data replication algorithm is as follows: 
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1 . The sequencer assigns sequence numbers only to messages that have been reported 
by at least k different data servers. This ensures that any message that has been assigned a sequence 
number is replicated at least k times. 

2. Before the sequencer informs the entire group about commitment of a sequence 
5 number to a specific message, the sequencer must first receive acknowledgments from at least k 

different commit servers. This ensures that the sequence number information is replicated at least 
k times. 

3 . The sequencer advertises checkpoint information only about checkpoints reported by 
ifs at least k checkpoint servers. This ensures that every checkpoint known to the entire group is 
W replicated at least k times. 

'0 By replicating information at least k times, SAM can withstand up to (k~l) failures among 

L data servers, checkpoint servers, and commit servers, respectively. And since we support persistent 
message delivery, SAM can withstand any number of failures among clients. The redundancy 

ifl requirement of course need not be the same for the data, commit, and checkpoint servers. 

15 With SAM, the sequencer needs to wait for messages from only k servers each time, instead 

of waiting for messages from all group members, as in Isis and Horus. The number of servers is 
generally very small compared to the number of clients. And it is possible to make the servers 
highly reliable, for example, by using special hardware. So k can be made very small 
Consequently, we no longer have to cater to the slowest member of the group, but only wait for k 

20 responses from k servers located on the same LAN/SAN. 
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The actual reliability can be much higher than is implied by server redundancy and data 
replication. This is because we can make the communication channels among servers - 
DATACHAN, COMMITCHAN, and CHKPNTCHAN - also highly reliable. Such highly reliable 
servers connected by highly reliable communication channels can drastically improve performance 
when the network suffers from packet loss, or when the group membership is highly dynamic. 
Furthermore, since clients generally request data from only one of the servers, multiple servers can 
work in parallel, making the system even more scalable. This is not the case in Isis and Horus, 
where increasing group size impairs performance. 

As a final remark, note that knowledge of group membership is almost unnecessary in SAM. 
Of course, such knowledge is needed at lower layers to provide multicast services; but there the cost 
is relatively low and working protocols such as IGMP manage group membership efficiently. 

While the features of the invention have been described and illustrated with some specificity, 
it will be understood by those skilled in the art that changes in the above description or illustration 
may be made with respect to form or detail without departing from the spirit and scope of the 
invention. 

Having thus described the invention: 
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WE CLAIM: 

1 . A method for multicasting data messages to members of a multicast group, the multicast 

group comprising a sequencer, one or more clients, one or more data servers, and one or 
more commit servers, the method comprising the steps of: 

transmitting a first data message to the members of the multicast group; 
each data server that receives the first data message requesting the sequencer 
to assign a first sequence number to the first data message, the first sequence number 
being from a sequence of numbers allocated to the data messages, said first sequence 
number following all sequence numbers assigned prior to assignment of the first 
sequence number; 

assigning the first sequence number to the first data message, in response to 
the sequencer receiving a first quantity of the requests to assign a first sequence 
number to the first data message; 

notifying the commit servers of the assignment of the first sequence number 
to the first data message; 

each of the commit servers sending to the sequencer an acknowledgment of 
the notification of the assignment of the first sequence number to the first data 
message, in response to being notified of the assignment of the first sequence number 
to the first data message; 



- 19 - 



Express Mail No.: EL320480176US 

committing the first sequence number to the first data message, in response 
to the sequencer receiving a second quantity of the acknowledgments of the 
notification of the assignment of the first sequence number to the first data message; 
and 

informing the members of the multicast group of the commitment of the first 
sequence number to the first data message. 

A method according to Claim 1, wherein 

said step of each data server that receives the first data message requesting the 
sequencer to assign a first sequence number to the first data message includes the 
step of sending, from said each data server that receives the first data message to the 
sequencer, a data report message identifying the first data message; 

said step of notifying the commit servers of the assignment of the first 
sequence number includes the step of submitting to the commit servers a commit 
submit message identifying the first data message; 

said step of sending to the sequencer an acknowledgment of the notification 
of the assignment of the first sequence number includes the step of sending to the 
sequencer a commit acknowledge message identifying the first data message; and 
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said step of informing the members of the multicast group of the commitment 
of the first sequence number includes the step of sending a commit message 
identifying the first data message to the members of the multicast group. 



A method according to Claim 2, 

further comprising the step of transmitting a second data message to the 
members of the multicast group; 

wherein said step of sending, from said each data server that receives the first 
data message to the sequencer, a data report message identifying the first data 
message further includes the step of a first data server sending a first data report 
message identifying the first data message to the sequencer after said first data server 
receives the second data message, said first data report message also identifying the 
second data message. 

A method according to Claim 2, further comprising the steps of: 

transmitting a second data message to the members of the multicast group; 
each data server that receives the second data message requesting the 
sequencer to assign a second sequence number, the second sequence number being 
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from the sequence of numbers allocated to the data messages, said second sequence 
number following all sequence numbers assigned prior to assignment of the second 
sequence number, to the second data message, said step of each data server that 
receives the second data message requesting the sequencer to assign a second 
sequence number to the second data message, includes the step of sending from said 
each data server that receives the second data message to the sequencer a data report 
message identifying the second data message; 

assigning the second sequence number to the second data message, in 
response to the sequencer receiving a third quantity of the requests to assign a second 
sequence number to the second data message; 

wherein said step of notifying the commit servers of the assignment of the first 
sequence number further includes the step of notifying the commit servers of the assignment 
of the second sequence number, said commit submit message identifying the first data 
message also identifying the second data message. 

A method according to Claim 2, wherein the members of the multicast group deliver the 
data messages to their respective upper layer applications in order of progressing sequence 
numbers, further including the step of using a receiver driven, negative acknowledgment- 
based approach to improve reliability of delivery of the data messages. 
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A method as in any one of Claims 1-5, wherein said data servers store said data messages 
transmitted to the multicast group, the multicast group further comprising checkpoint 
servers, the method further including the steps of: 

step for message consolidation; 

step for garbage collection; and 

step for storing said first sequence number in stable storage. 

A method for processing data messages multicast to members of a multicast group, the 
multicast group comprising a sequencer, one or more clients, one or more data servers, and 
one or more commit servers, the method comprising the steps of; 

each data server that receives said each data message requesting the sequencer 
to assign a sequence number, from a sequence of numbers allocated to the data 
messages, to said each data message, in response to receiving each data message; 

assigning a sequence number following all sequence numbers assigned prior 
to assignment of the sequence number to said each data message, in response to the 
sequencer receiving a first quantity of requests to assign a sequence number to said 
each data message; 

notifying the commit servers of each assignment, each notification identifying 
said each assignment by said each data message and the sequence number assigned 
to said each data message; 
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each of the commit servers sending to the sequencer an acknowledgment of 
said each notification, in response to being notified of said each assignment, said 
acknowledgment identifying said each data message; 

committing said each assignment, in response to the sequencer receiving a 
second quantity of the acknowledgments identifying said each data message; and 

informing the members of the multicast group of each commitment. 

A method according to Claim 7, wherein 

the members of the multicast group deliver the data messages to their 
respective upper layer applications in order of progressing sequence numbers; 

said data servers store said data messages transmitted to the multicast group; 

further including the step of using a receiver driven, negative 
acknowledgment-based approach to improve reliability of delivery of the data 
messages. 

A method according to Claim 8, wherein said each data message is associated with a 
unique message ID and is identifiable from its associated message ID, the step of using 
further includes the steps of: 
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each member of the multicast group identifying gaps in a progression of 
sequence numbers known by said each member of the multicast group to have been 
committed to data messages received by said each member of the multicast group; 

if said each member of the multicast group does not know a first message ID, 
said first message ID being associated with a first data message, a first sequence 
number within one of said gaps having been previously committed to said first data 
message, said each member of the multicast group querying one of said commit 
servers to obtain said first message ID; and 

if said each member of the multicast group has not received said first data 
message, querying one of said data servers to retrieve said first data message. 

A method according to Claim 8, wherein said each data message is associated with a 
unique message ED and is identifiable from its associated message ID, the step of using 
further includes the steps of: 

each member of the multicast group identifying gaps in a progression of 
sequence numbers known by said each member of the multicast group to have been 
committed to data messages received by said each member of the multicast group; 

said each member of the multicast group querying one of said data servers to 
retrieve said first data message. 
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A method according to Claim 10, further comprising the step of said sequencer 
periodically generating and sending heartbeat messages to the members of the multicast 
group, each said heartbeat message containing an associated largest sequence number, said 
associated largest sequence number being the last sequence number committed at a time 
substantially equal to a time said heartbeat message is generated. 

A method according to Claim 8, further comprising the step for periodic message 
consolidation. 

A method according to Claim 8, wherein the multicast group further comprises one or 
more checkpoint servers, the method further comprising the step of performing periodic 
message consolidation by said checkpoint servers at message intervals determined through 
a common consensus protocol, each message consolidation producing a checkpoint 
associated with said each message consolidation, said checkpoint associated with said each 
message consolidation corresponding to a terminal data message, said checkpoint associated 
with said each message consolidation containing checkpoint information, the checkpoint 
information being sufficient for a first upper layer application of said upper layer applications 
to reconstruct a cumulative system state said first upper layer application would attain upon 
receiving said terminal message and all said data messages that preceded said terminal 
message. 
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A method according to Claim 13, further comprising the step of said checkpoint servers 
periodically generating and sending checkpoint reports to said sequencer, each checkpoint 
report corresponding to latest checkpoint at the time said each checkpoint report is generated, 
said each Checkpoint Report identifying a sequence number of its corresponding terminal 
data message, said each checkpoint report carrying size data of the latest checkpoint. 

A method according to Claim 14, further comprising step for synchronizing a first 
asynchronous upper layer process of a first asynchronous member of the multicast group 
with other members of the multicast group, said first asynchronous member not being said 
sequencer or one of said data or commit servers. 

A method according to Claim 14, further comprising the step of synchronizing a first 
asynchronous upper layer process of a first asynchronous member of the multicast group 
with other members of the multicast group, said first asynchronous member not being said 
sequencer or one of said data or commit servers, said synchronizing step including the steps 
of: 

said first asynchronous member retrieving a first checkpoint from said 
checkpoint servers; 
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A method according to Claim 16, wherein said each data message bears a corresponding 
logical timestamp, said logical timestamp including a most recent sequence number known 
to original sender of said each data message when said each data message was first sent. 



A method according to Claim 18, further comprising the step of; 

the data servers deleting said stored messages that have logical checkpoints 
older by a maximum logical lifetime number at the time of deletion than a most recent 
sequence number known at the time of deletion. 

A method according to Claim 14, further comprising the step of: 

said data servers deleting the stored data messages that are older than the 
latest checkpoint. 

A method according to Claim 16, wherein the multicast group further includes stable 
storage writeable by said sequencer, said method further comprising the step of said 
sequencer storing in said stable storage said assigned sequence number before said step of 
notifying the commit servers. 
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A method according to Claim 8, wherein the multicast group further includes stable 
storage writeable by said sequencer, said method further comprising the step of said 
sequencer storing in said stable storage said assigned sequence number before said step of 
notifying the commit servers. 
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ABSTRACT 

This document describes a protocol for reliably synchronizing states of nodes in a distributed 
environment through use of a Scalable Atomic Multicast (SAM) Service that ensures both atomicity 
5 and total order among messages sent to a multicast group. In addition to possessing good scalability 
property, this fault-tolerant protocol does not require explicit knowledge of multicast group 
membership, allows for non-disturbing state synchronization, and supports asynchronous non- 
blocking communications. According to one aspect of this invention, a dedicated sequencer is 
til responsible solely for assigning sequence numbers to the multicast messages. The sequencer does 
Iff not multicast the messages. Another aspect of the invention is the use of receiver-driven negative 
acknowledgments. According to third aspect, the invention supports message consolidation and 
garbage collection. 
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IN THE UNITED STATES 

PATENT AND TRADEMARK OFFICE 

Declaration and Power of Attorney 
As the below named inventor, I hereby declare that: 

My residence, post office address and citizenship are as stated below next to my name. 

I hereby claim the benefit under Title 35, United States Code, 1 19(e) of any United States provisional 
application^) identified below: 

Provisional application No. 60/098,065, filed on August 27, 1998/ 

I believe I am the original, first and sole inventor of the subject matter which is claimed and for which a 
patent is sought on the invention entitled SCALABLE ATOMIC MULTICAST (SAM) the specification of which 
is attached hereto. 

I hereby state that I have reviewed and understand the contents of the above identified specification, 
including the claims, as amended by an amendment, if any, specifically referred to in this oath or declaration. 

I acknowledge the duty to disclose all information known to me which is material to patentability as 
defined in Title 37, Code of Federal Regulations, 1.56. 

I hereby claim foreign priority benefits under Title 35, United States Code, 119 of any foreign 
application^) for patent or inventor's certificate listed below and have also identified below any foreign application 
for patent or inventor's certificate having a filing date before that of the application on which priority is claimed: 

None 

I hereby claim the benefit under Title 35, United States Code, 120 of any United States applications) listed 
below and, insofar as the subject matter of each of the claims of this application is not disclosed in the prior United 
States application in the manner provided by the first paragraph of Title 35, United States Code, 112, I 
acknowledge the duty to disclose all information known to me to be material to patentability as defined in Title 37, 
Code of Federal Regulations, 1.56 which became available between the filing date of the prior application and the 
national or PCT international filing date of this application: 

None 

I hereby declare that all statements made herein of my own knowledge are true and that all statements 
made on information and belief are believed to be true; and further that these statements were made with the 
knowledge that willful false statements and the like so made are punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code and that such willful false statements may jeopardize the validity 
of the application or any patent issued thereon. 
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I hereby appoint the following attorney(s) with full power of substitution and revocation, to prosecute said 
application, to make alterations and amendments therein, to receive the patent, and to transact all business in the 
Patent and Trademark Office connected therewith: 



Lester H. Birnbaum 
Richard J. Botos 
Jeffery J. Brosemer 
Kenneth M. Brown 
Craig J. Cox 
Donald P. Dinella 
Guy Eiiksen 
Martini Finston 
James H. Fox 
William S. Francos 
Barry H. Freedman 
Julio A. Garceran 
Mony R. Ghose 
Jimmy Goo 
Anthony Grillo 
Stephen M. Gurey 
John M. Harman 
Michael B. Johannesen 
Mark A. Kurisko 
Irena Lager 

Christopher N. Malvone 
Scott W. McLellan 
Martin G. Meder 
John C. Moran 
Michael A. Morra 
Gregory J. Murgia 
Claude R. Narcisse 
Joseph J. Opalach 
Neil R. Ormos 
Eugen E. Pacher 
Jack R. Penrod 
Daniel J. Piotrowski 
Gregory C. Ranieri 
Scott J. Rittman 
Eugene J. Rosenthal 
Bruce S. Schneider 
Ronald D. Slusky 
David L. Smith 
Patricia A. Verlangieri 
John P. Veschi 
David Volejnicek 
Charles L. Warren 
Jeffrey M. Weinick 
Eli Weiss 



(Reg. No. 25830) 
(Reg. No. 32016) 
(Reg. No. 36096) 
(Reg. No. 37590) 
(Reg. No. 39643) 
(Reg. No. 39961) 
(Reg. No. 41736) 
(Reg. No. 31613) 
(Reg. No. 29379) 
(Reg. No. 38456) 
(Reg. No. 26166) 
(Reg. No. 37138) 
(Reg. No. 38159) 
(Reg. No. 36528) 
(Reg. No. 36535) 
(Reg. No. 27336) 
(Reg. No. 38173) 
(Reg. No. 35557) 
(Reg. No. 38944) 
(Reg. No. 39260) 
(Reg. No. 34866) 
(Reg. No. 30776) 
(Reg. No. 34674) 
(Reg. No. 30782) 
(Reg. No. 28975) 
(Reg. No. 41209) 
(Reg. No. 38979) 
(Reg. No. 36229) 
(Reg. No. 35309) 
(Reg. No. 29964) 
(Reg. No. 31864) 
(Reg. No. 42079) 
(Reg. No. 29695) 
(Reg. No. 39010) 
(Reg. No. 36658) 
(Reg. No. 27949) 
(Reg. No. 26585) 
(Reg. No. 30592) 
(Reg. No. 42201) 
(Reg. No. 39058) 
(Reg. No. 29355) 
(Reg. No. 27407) 
(Reg. No. 36304) 
(Reg. No. 17765) 



I hereby appoint the attorney(s) on ATTACHMENT A as associate attorney(s) in the aforementioned 
application, with full power solely to prosecute said application, to make alterations and amendments therein, to 
receive the patent, and to transact all business in the Patent and Trademark Office connected with the prosecution 
of said application. No other powers are granted to such associate attorney(s) and such associate attorney(s) are 
specifically denied any power of substitution or revocatioa 



Full name of sole inventor (or 1st joint inventor): Hong-Yi T?eng 

Inventor's signature ^ r^^^^^ Date. 

Residence: Tinton Falls, NJ 07712 
Citizenship: USA 

Post Office Address: 14 Lakeview Drive, Monmouth, NJ 07712 



Full name of 2nd joint inventor: Yin Zhang 

Inventor's signatur e ^^^j^^ Date ^ ' ' ^ 1 

Residence: Berkeley, CA 
Citizenship: China 

Post Office Address: 2524 Milvia Street, Berkeley, CA 94704 
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ATTACHMENT A 



Attorney Name(s): Keith D. Nowak Reg. No.: 21,361 



Telephone calls should be made to Keith D. Nowak, Iieberman & Nowak, LLP at: 
Phone No.: (2121 532-4447 
Fax No.: (212) 481-0543 

All written communications are to be addressed to: 

Keith D. Nowak 
Lieberman & Nowak, LLP 
805 Third Avenue 
New York, NY 10022 



